Search CORE

9 research outputs found

A skewness-aware matrix factorization approach for mesh-structured cloud services

Author: Barlet Ros Pere
Fu Yongquan
Huang Chun
Huang Zhen
Li Dongsheng
Shen Siqi
Su Huayou
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Online cloud services need to fulfill clients' requests scalably and fast. State-of-the-art cloud services are increasingly deployed as a distributed service mesh. Service to service communication is frequent in the mesh. Unfortunately, problematic events may occur between any pair of nodes in the mesh, therefore, it is vital to maximize the network visibility. A state-of-the-art approach is to model pairwise RTTs based on a latent factor model represented as a low-rank matrix factorization. A latent factor corresponds to a rank-1 component in the factorization model, and is shared by all node pairs. However, different node pairs usually experience a skewed set of hidden factors, which should be fully considered in the model. In this paper, we propose a skewness-aware matrix factorization method named SMF. We decompose the matrix factorization into basic units of rank-one latent factors, and progressively combine rank-one factors for different node pairs. We present a unifying framework to automatically and adaptively select the rank-one factors for each node pair, which not only preserves the low rankness of the matrix model, but also adapts to skewed network latency distributions. Over real-world RTT data sets, SMF significantly improves the relative error by a factor of 0.2 x to 10 x, converges fast and stably, and compactly captures fine-grained local and global network latency structures.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

On the Transformation Optimization for Stencil Computation

Author: Huayou Su
Publication venue: 'MDPI AG'
Publication date: 23/12/2021
Field of study

Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×

Multidisciplinary Digital Publishing Institute

Performance of sediment transport simulations on NVIDIA’s Kepler architecture

Author: Cai Xing
Su Huayou
Wen Mei
Wu Nan
Zhang Chunyuan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

Aiming to understand how high-performance CUDA programming can be done for NVIDIA's new Kepler architecture, we have investigated a specific case of simulating sediment transport. The arisen stencil computations have distinct features connected to the two nonlinear partial differential equations that constitute the mathematical model. Consequently, the required CUDA programming effort differs for the two corresponding CUDA kernel functions. While Kepler's new read-only data cache brings enough benefits for one kernel function, performance of the other kernel function is further enhanceable through using the shared memory and so-called halo threads. The highest achieved performance of the stencil computation amounts to 190.45 GFLOPs on a Tesla K20 GPU

Elsevier - Publisher Connector

NORA - Norwegian Open Research Archives

On the Transformation Optimization for Stencil Computation

Author: Huayou Su
Kaifang Zhang
Songzhu Mei
Publication venue: MDPI AG
Publication date: 01/12/2021
Field of study

Directory of Open Access Journals

HPGraph: High-Performance Graph Analytics with Productivity on the GPU

Author: Chunyuan Zhang
Haoduo Yang
Huayou Su
Mei Wen
Qiang Lan
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2018
Field of study

The growing use of graph in many fields has sparked a broad interest in developing high-level graph analytics programs. Existing GPU implementations have limited performance with compromising on productivity. HPGraph, our high-performance bulk-synchronous graph analytics framework based on the GPU, provides an abstraction focused on mapping vertex programs to generalized sparse matrix operations on GPU as the backend. HPGraph strikes a balance between performance and productivity by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model for users to implement various graph algorithms with relatively little effort. We evaluate the performance of HPGraph for four graph primitives (BFS, SSSP, PageRank, and TC). Our experiments show that HPGraph matches or even exceeds the performance of high-performance GPU graph libraries such as MapGraph, nvGraph, and Gunrock. HPGraph also runs significantly faster than advanced CPU graph libraries

Directory of Open Access Journals

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Author: Chunyuan Zhang
Huayou Su
Ju Ren
Mei Wen
Nan Wu
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA’s GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design

Crossref

Directory of Open Access Journals

A Skewness-Aware Matrix Factorization Approach for Mesh-Structured Cloud Services

Author: Chun Huang
Dongsheng Li
Huayou Su
Pere Barlet-Ros
Siqi Shen
Yongquan Fu
Zhen Huang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

On Intuitionistic Fuzzy Copula Aggregation Operators in Multiple- Attribute Decision Making

Author: B Peng
BD Baets
Bing Han
C Cornelis
C Genest
DF Li
DF Li
DH Hong
E Liebscher
E Szmidt
FE Boran
FY Meng
G Beliakov
G Deschrijver
G Mayor
H Bal
Huayou Chen
HW Liu
HY Zhang
J Puri
JQ Wang
JQ Wang
JZ Wu
KT Atanassov
KT Atanassov
KT Atanassov
LA Zadeh
M Fréchet
MM Xia
P Burillo
PD Liu
PM Vasant
PM Vasant
PM Vasant
Q Lei
R Parvathi
RR Yager
SM Chen
SM Chen
T Bacigál
V Torra
W Hoeffding
WZ Wang
XF Zhao
Y Ouyang
YD He
YJ Xu
YM Wang
Z Pei
ZB Wu
Zhifu Tao
ZP Chen
ZS Xu
ZS Xu
ZS Xu
ZX Su
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Cryo-EM structure of the protein-conducting ERAD channel Hrd1 in complex with Hrd3

Author: A Kucukelbir
A Stein
Alexander Stein
AM Waterhouse
B Van den Berg
BK Sato
D Lyumkis
David Baker
DN Mastronarde
Dongsheng Li
E Jarosch
E Park
E Rabinovich
F DiMaio
Frank DiMaio
H Jeong
H Nordberg
H Ru
Huayou Su
J Bordallo
JA Mindell
JG McCoy
K Katoh
K Kumazaki
KD Tsirigos
M Mehnert
M Remmert
Maofu Liao
Melissa G. Chambers
NW Bays
NW Bays
P Carvalho
P Carvalho
PM Deak
R Dutzler
R Gauss
RD Baldridge
RD Finn
RE Dalbey
RG Gardner
RM Gemmill
Ryan Pavlovicz
S Ovchinnikov
S Ovchinnikov
S Ovchinnikov
S Vashist
S Wang
S Wang
SC Horn
Sergey Ovchinnikov
SHW Scheres
SQ Zheng
Stefan Schoebel
T Gonen
T Grant
Tom A. Rapoport
V Denic
Wei Mi
X-C Bai
Y Song
Y Ye
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref